In this study, we undertake a Bayesian analytical approach to explore the factors influencing customer decisions to churn (cancel services) in a hypothetical telecommunications company. Our primary aim is to construct an ideal model that accurately identifies the most relevant parameters influencing a customer’s decision to cancel their service.
To achieve this, we will embark on a journey of model development and refinement, starting with a base model. This model will be our initial attempt at understanding the churn behavior. Recognizing the potential for model improvement, we will then incorporate informative priors through a LASSO approach, aiming to enhance the model’s performance and focus on the most significant predictors. Finally, we will streamline our model by removing less informative variables, creating a more efficient and interpretable model.
Throughout this process, we emphasize the importance of rigorous diagnostics and adjustments. By systematically analyzing and enhancing our model, we aim to strike a balance between complexity and interpretability, ensuring our model is both accurate and insightful. The outcome will be a comprehensive Bayesian analysis, shedding light on the intricacies of customer churn in the telecom sector.
We begin by installing and loading the necessary R packages. These packages will provide us with tools for Bayesian analysis, data manipulation, and visualization.
options(repos = c(CRAN = "http://cran.rstudio.com/"))
# Installing necessary libraries
# install.packages("rstan")
# install.packages("modeldata")
# install.packages("ggplot2")
# install.packages("heatmaply")
# install.packages("loo")
# install.packages("bayesplot")
# Loading the libraries
library(rstan)
library(modeldata)
library(ggplot2)
library(heatmaply)
library(loo)
library(bayesplot)
The dataset utilized in this analysis, known as “Customer Churn Data,” is obtained from the “modeldata” package. Although these data are artificial, they have been constructed to mirror patterns and characteristics found in real-world business scenarios, specifically within the telecommunications sector. This alignment with real-world scenarios ensures that our analysis and conclusions are grounded in practical and applicable contexts.
In our pursuit of a simplified and focused analysis, we elected to exclude the variable related to the voice mail plan. Additionally, to streamline our data, we consolidated variables related to daily usage periods (morning, afternoon, and night) into a single representative variable for total daily usage.
Key variables in our analysis include: - international_plan: A binary variable (factor), indicating the presence (“yes”) or absence (“no”) of an international plan. - total_minutes (total_day_minutes + total_eve_minutes + total_night_minutes): A numeric variable representing the sum of minutes used during the day, evening, and night. - total_calls (total_day_calls + total_eve_calls + total_night_calls): An integer variable indicating the total number of calls made during the day, evening, and night. - total_charge (total_day_charge + total_eve_charge + total_night_charge): A numeric variable summarizing the charges for service use throughout the day, evening, and night. - total_intl_minutes: A numeric variable for the total minutes in international calls. - total_intl_calls: An integer variable for the total number of international calls. - total_intl_charge: A numeric variable for the total charges for international calls. - number_customer_service_calls: An integer variable indicating the number of calls to customer service. - churn: A binary variable (factor), signifying service cancellation (“yes”) or continuation (“no”).
# Load the Customer churn data
data("mlc_churn", package = "modeldata")
# Adjust 'yes' and 'no' to 1 and 0 in necessary columns
mlc_churn$churn <- as.integer(mlc_churn$churn == "yes")
mlc_churn$international_plan <- as.integer(mlc_churn$international_plan == "yes")
# Select and transform desired columns
mlc_filtered <- mlc_churn[, c(
"international_plan",
"total_day_minutes",
"total_day_calls",
"total_day_charge",
"total_eve_minutes",
"total_eve_calls",
"total_eve_charge",
"total_night_minutes",
"total_night_calls",
"total_night_charge",
"total_intl_minutes",
"total_intl_calls",
"total_intl_charge",
"number_customer_service_calls",
"churn"
)]
# Create aggregated features
mlc_filtered$total_calls <- rowSums(mlc_filtered[, c("total_day_calls", "total_eve_calls", "total_night_calls")])
mlc_filtered$total_minutes <- rowSums(mlc_filtered[, c("total_day_minutes", "total_eve_minutes", "total_night_minutes")])
mlc_filtered$total_charge <- rowSums(mlc_filtered[, c("total_day_charge", "total_eve_charge", "total_night_charge")])
# Remove original columns
mlc_filtered <- mlc_filtered[, !(colnames(mlc_filtered) %in% c("total_day_calls", "total_day_minutes", "total_day_charge", "total_eve_calls", "total_eve_minutes", "total_eve_charge", "total_night_calls", "total_night_minutes", "total_night_charge"))]
Before fitting the model, we select and order the desired columns in the dataset. We also scale the predictors to ensure they are on the same scale. This step is crucial for the performance of the model.
# Select and order desired columns for the model
cols_to_plot <- c(
"international_plan",
"total_charge",
"total_minutes",
"total_calls",
"total_intl_charge",
"total_intl_minutes",
"total_intl_calls",
"number_customer_service_calls",
"churn"
)
# Calculate the number of predictors
k <- length(cols_to_plot) - 1
# Select columns and scale the predictors
mlc_filtered <- mlc_filtered[, cols_to_plot]
mlc_filtered[, -ncol(mlc_filtered)] <- scale(mlc_filtered[, -ncol(mlc_filtered)])
We will now define our Bayesian model using Stan, a powerful tool for statistical modeling and high-performance statistical computation. The model aims to understand the factors influencing customer churn.
The first Stan model is specified as follows:
N, the number of
predictors K, the matrix of predictors x, and
the binary response variable y.alpha and the coefficients
beta.y
as a Bernoulli distribution with a logit link function, without
implementing LASSO regularization.This model serves as a baseline for comparison with more complex models, such as those including LASSO regularization.
stan_code <- "
data {
int<lower=0> N;
int<lower=0> K;
matrix[N, K] x;
array[N] int<lower=0, upper=1> y;
}
parameters {
real alpha;
vector[K] beta;
}
model {
y ~ bernoulli_logit(x * beta + alpha);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | x[n] * beta + alpha);
}
}
"
With the Stan model defined, the next step is to execute the model
using the stan function from the rstan
package. This involves setting up the data for the model, fitting the
model, and extracting the posterior distributions of the model
parameters.
First, we prepare the data in the format required by the Stan model. This includes the number of observations, the predictor matrix, the response variable.
# Prepare data for Stan model
stan_data <- list(
N = nrow(mlc_filtered),
x = as.matrix(mlc_filtered[, -ncol(mlc_filtered)]),
y = mlc_filtered$churn,
K = k # Number of predictors
)
Now we fit the model using the stan function. This
process involves specifying the model code, data, number of chains,
iterations, and cores for parallel computation.
num_iter = 1000 # Number of iterations
start_time <- Sys.time()
stan_fit <- stan(model_code = stan_code, data = stan_data, chains = 2, iter = num_iter, cores = 4)
## Trying to compile a simple C file
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## using C compiler: ‘Apple clang version 14.0.3 (clang-1403.0.22.14.1)’
## using SDK: ‘MacOSX13.3.sdk’
## clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/x86_64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
## Warning: There were 939 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
end_time <- Sys.time()
execution_time <- end_time - start_time
After fitting the Stan model, performing a comprehensive diagnostic analysis is essential to assess the quality of the fit and the reliability of the model. This analysis involves several key metrics and tests.
The Effective Sample Size (ESS) provides an estimate of the number of independent samples equivalent to the correlated samples drawn from the Markov Chain. A higher ESS indicates better precision and reliability of the estimates, suggesting that the sampling process is adequately exploring the posterior distribution.
The Rhat statistic is a measure of convergence. It compares the variance within each chain to the variance between chains. Values of Rhat close to 1 (typically less than 1.1) indicate that all chains have converged to the same distribution, ensuring that our inferences are based on a stable posterior distribution.
LOO Cross-Validation is used to estimate the out-of-sample predictive
accuracy of the model. It involves removing one observation at a time
and predicting its value using the model fitted on the remaining data.
Key metrics from the LOO analysis include the elpd_loo
(expected log pointwise predictive density), which quantifies the
model’s predictive accuracy, p_loo, which estimates the
effective number of parameters, and looic (LOO information
criterion), a measure for model comparison.
HMC Diagnostics are crucial for assessing the performance of the sampling algorithm. We examine the number of divergent iterations and the number of iterations that hit the maximum tree depth. A high number of divergent iterations may indicate issues with the model such as extreme curvature. Similarly, hitting the maximum tree depth frequently can suggest that the algorithm is struggling to explore the posterior effectively.
We combine these diagnostics into a comprehensive summary, using custom functions to extract and collate key metrics from the model. This summary includes the average ESS and Rhat across all parameters, alongside the LOO and HMC diagnostics, providing a holistic view of the model’s performance and reliability.
create_parameter_names <- function(k) {
pars <- character(k + 1)
for (i in 1:k) {
pars[i] <- paste("beta[", i, "]", sep = "")
}
pars[k + 1] <- "alpha"
return(pars)
}
pars <- create_parameter_names(k)
extract_diagnostics <- function(stan_fit, pars) {
# Extract diagnostics from the Stan model summary
summary_stan <- summary(stan_fit)$summary
params_mean_sd <- summary_stan[pars, c("mean", "sd")]
params_ess <- summary_stan[pars, "n_eff"]
params_rhat <- summary_stan[pars, "Rhat"]
params_diagnostics <- data.frame(params_mean_sd, ESS = params_ess, Rhat = params_rhat)
# Return the diagnostics
return(params_diagnostics)
}
model_performance_summary <- function(stan_fit,pars,time,name) {
# Calculating LOO results
loo_result <- loo(stan_fit)
# Extracting key LOO metrics
elpd_loo <- loo_result$elpd_loo
p_loo <- loo_result$p_loo
looic <- loo_result$looic
# Checking HMC diagnostics
divergences <- sum(get_divergent_iterations(stan_fit))
max_treedepth <- sum(get_num_max_treedepth(stan_fit))
stan_summary <- summary(stan_fit)$summary
stan_summary <- stan_summary[pars, c("n_eff", "Rhat")]
mean_ess <- mean(stan_summary[,'n_eff'])
mean_rhat <- mean(stan_summary[,'Rhat'])
# Creating a dataframe with the results
performance_summary_df <- data.frame(
name = name,
time = time,
divergences = divergences,
max_treedepth = max_treedepth,
elpd_loo = elpd_loo,
p_loo = p_loo,
looic = looic,
mean_ess = mean_ess,
mean_rhat = mean_rhat
)
return(performance_summary_df)
}
performance_summary_df <- model_performance_summary(stan_fit,pars,execution_time, "Base Model")
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in a
## future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing p_loo using '$' is deprecated and will be removed in a
## future release. Please extract the p_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing looic using '$' is deprecated and will be removed in a
## future release. Please extract the looic estimate from the 'estimates'
## component instead.
diagnostics <- extract_diagnostics(stan_fit, pars)
print(diagnostics)
## mean sd ESS Rhat
## beta[1] 0.594250844 0.03421001 565.4487 1.002312
## beta[2] 0.789001517 0.10519753 320.2255 1.005396
## beta[3] 0.033299029 0.10416333 358.7173 1.004245
## beta[4] -0.008492584 0.04566378 638.1076 0.998047
## beta[5] -1.239934489 12.46973366 309.8864 1.003457
## beta[6] 1.476949266 12.47033849 309.8147 1.003486
## beta[7] -0.171247765 0.04927416 768.2224 1.000535
## beta[8] 0.665656446 0.04318015 546.8007 1.006796
## alpha -2.287116879 0.05570167 457.7478 1.002231
print(performance_summary_df)
## name time divergences max_treedepth elpd_loo p_loo
## 1 Base Model 9.044425 mins 0 939 -1638.618 11.01692
## looic mean_ess mean_rhat
## 1 3277.235 474.9968 1.002945
Based on the diagnostics and analysis of the initial model, we decide to incorporate a LASSO prior (Laplace prior) for the coefficients. This approach can help in feature selection and potentially improve model performance by imposing shrinkage on the coefficients, which is particularly useful in models with many predictors.
The Stan model is now updated to include a Laplace prior for each
coefficient, controlled by the lambda parameter. This prior
encourages sparsity in the coefficients, which can be beneficial in
selecting the most relevant features.
stan_code <- "
data {
int<lower=0> N;
int<lower=0> K;
matrix[N, K] x;
array[N] int<lower=0, upper=1> y;
real<lower=0> lambda; // Shrinkage parameter for LASSO
}
parameters {
real alpha;
vector[K] beta;
}
model {
for (k in 1:K) {
beta[k] ~ double_exponential(0, lambda); // Laplace prior (double_exponential)
}
y ~ bernoulli_logit(x * beta + alpha);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | x[n] * beta + alpha);
}
}
"
We now fit the updated Stan model with the LASSO prior. This process
involves the same steps as the initial model but with the updated
stan_code.
stan_data <- list(
N = nrow(mlc_filtered),
x = as.matrix(mlc_filtered[, -ncol(mlc_filtered)]),
y = mlc_filtered$churn,
lambda = 1,
K = k
)
start_time <- Sys.time()
stan_fit_lasso <- stan(model_code = stan_code, data = stan_data, chains = 2, iter = num_iter, cores = 4)
## Trying to compile a simple C file
## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## using C compiler: ‘Apple clang version 14.0.3 (clang-1403.0.22.14.1)’
## using SDK: ‘MacOSX13.3.sdk’
## clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/x86_64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:628:16: error: expected ';' after top level declarator
## namespace Eigen {
## ^
## ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
## In file included from /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
## ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1
end_time <- Sys.time()
execution_time <- end_time - start_time
After fitting the Stan model with the LASSO prior, our next step is to extract and analyze its diagnostics. We aim to compare these diagnostics with those from the initial model to assess any improvements in model performance. Specifically, we’ll look at key metrics like the mean, standard deviation, effective sample size (ESS), and Rhat for each parameter. Improvements in these diagnostics can indicate a better-fitting and more reliable model.
To facilitate this comparison, we will use the custom functions
create_parameter_names and
extract_diagnostics. create_parameter_names
generates the vector of parameter names, and
extract_diagnostics retrieves important diagnostic metrics.
By examining these diagnostics side-by-side with those from the initial
model, we can gauge the effectiveness of the LASSO prior in enhancing
our model.
pars <- create_parameter_names(k)
diagnostics_lasso <- extract_diagnostics(stan_fit_lasso, pars)
performance_summary_lasso <- model_performance_summary(stan_fit_lasso, pars, execution_time, "Lasso Model")
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in a
## future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing p_loo using '$' is deprecated and will be removed in a
## future release. Please extract the p_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing looic using '$' is deprecated and will be removed in a
## future release. Please extract the looic estimate from the 'estimates'
## component instead.
combined_performance_summary <- rbind(performance_summary_df, performance_summary_lasso)
# Compare performance summary
print(combined_performance_summary)
## name time divergences max_treedepth elpd_loo p_loo
## 1 Base Model 9.044425 mins 0 939 -1638.618 11.016918
## 2 Lasso Model 1.922301 mins 0 0 -1637.283 9.666711
## looic mean_ess mean_rhat
## 1 3277.235 474.9968 1.0029450
## 2 3274.566 666.4488 0.9998915
We now proceed to a graphical analysis of the model parameters to uncover potential correlations and insights.
First, we visualize the traceplot, histogram, autocorrelation, and comparative distribution areas of the parameters. These visualizations are crucial for assessing convergence, distribution characteristics, and potential autocorrelations between samples.
# Traceplot for convergence and mixing
traceplot(stan_fit_lasso, pars = pars)
# Histogram of posterior distributions
mcmc_hist(stan_fit_lasso, pars = pars)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Autocorrelation function for samples
mcmc_acf(stan_fit_lasso, pars = pars)
## Warning: The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
## ℹ Please use the `rows` argument instead.
## ℹ The deprecated feature was likely used in the bayesplot package.
## Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Comparative visualization of posterior distributions
mcmc_areas(stan_fit_lasso, pars = pars)
Next, we examine pairs plots for selected parameters to investigate
potential correlations. From our previous analysis, we identified that
the coefficients corresponding to beta[3] and
beta[6] might be considered irrelevant. This conclusion is
based on the observation that their standard deviations exceed their
mean values, indicating significant uncertainty.
In light of this finding, we decide to simplify our model by removing
these parameters. Specifically, we will remove the variables
corresponding to total minutes and
total intl minutes. This simplification aims to improve the
interpretability of the results without compromising the model’s
predictive capability. Removing variables with problematic parameters
can often lead to a more robust and interpretable model.
# Pairs plot for beta[2] and beta[3]
mcmc_pairs(stan_fit_lasso, pars = c("beta[2]", "beta[3]"))
# Pairs plot for beta[5] and beta[6]
mcmc_pairs(stan_fit_lasso, pars = c("beta[5]", "beta[6]"))
After deciding to remove certain parameters, we will now refit the Stan model with a simplified set of predictors. This involves using the already scaled data with the selected columns.
First, we adjust our dataset to include only the columns deemed most relevant, based on our previous analysis.
# Select and order desired columns for the model
cols_to_plot <- c(
"international_plan",
"total_charge",
"total_calls",
"total_intl_charge",
"total_intl_calls",
"number_customer_service_calls",
"churn"
)
# Update the dataset with selected columns
mlc_filtered <- mlc_filtered[, cols_to_plot]
# Calculate the number of predictors
k <- length(cols_to_plot) - 1
With the revised dataset, we now refit our Stan model. This step involves updating the data input for the model and executing the fitting process again.
# Prepare data for Stan model
stan_data <- list(
N = nrow(mlc_filtered),
x = as.matrix(mlc_filtered[, -ncol(mlc_filtered)]),
y = mlc_filtered$churn,
lambda = 1, # Shrinkage parameter for LASSO (used in later models)
K = k # Number of predictors
)
# Fit the model with the updated data
start_time <- Sys.time()
stan_fit_updated <- stan(model_code = stan_code, data = stan_data, chains = 2, iter = num_iter, cores = 4)
end_time <- Sys.time()
execution_time <- end_time - start_time
After a thorough process of modeling and refinement, we have arrived at our Final Model. This model has been streamlined to focus on the most informative parameters for predicting customer churn.
We extract the posterior distributions of the model parameters and perform diagnostic checks to ensure the model’s reliability.
# Extract posterior distributions for the final model
posterior <- extract(stan_fit_updated)
posterior_alpha <- posterior$alpha
posterior_beta <- posterior$beta
# Generate parameter names for the final model
pars <- create_parameter_names(k)
# Extract diagnostics and summary for the final model
diagnostics_final <- extract_diagnostics(stan_fit_updated, pars)
performance_summary_final <- model_performance_summary(stan_fit_updated, pars, execution_time, "Final Model")
## Warning: Accessing elpd_loo using '$' is deprecated and will be removed in a
## future release. Please extract the elpd_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing p_loo using '$' is deprecated and will be removed in a
## future release. Please extract the p_loo estimate from the 'estimates'
## component instead.
## Warning: Accessing looic using '$' is deprecated and will be removed in a
## future release. Please extract the looic estimate from the 'estimates'
## component instead.
combined_performance_summary <- rbind(combined_performance_summary, performance_summary_final)
We compare the diagnostics across different iterations of the model to assess improvements and the impact of our refinements.
print(diagnostics)
## mean sd ESS Rhat
## beta[1] 0.594250844 0.03421001 565.4487 1.002312
## beta[2] 0.789001517 0.10519753 320.2255 1.005396
## beta[3] 0.033299029 0.10416333 358.7173 1.004245
## beta[4] -0.008492584 0.04566378 638.1076 0.998047
## beta[5] -1.239934489 12.46973366 309.8864 1.003457
## beta[6] 1.476949266 12.47033849 309.8147 1.003486
## beta[7] -0.171247765 0.04927416 768.2224 1.000535
## beta[8] 0.665656446 0.04318015 546.8007 1.006796
## alpha -2.287116879 0.05570167 457.7478 1.002231
print(diagnostics_lasso)
## mean sd ESS Rhat
## beta[1] 0.593623629 0.03396851 1035.5786 0.9984698
## beta[2] 0.787025203 0.10299320 336.2980 1.0001961
## beta[3] 0.031420157 0.09796483 387.1637 1.0005968
## beta[4] -0.006669103 0.04370173 901.0240 0.9995705
## beta[5] 0.110165566 0.67475132 394.8947 1.0003335
## beta[6] 0.125605381 0.67560043 397.5920 1.0003491
## beta[7] -0.170671845 0.05013646 849.4523 1.0014535
## beta[8] 0.660916985 0.04306591 884.1528 0.9994730
## alpha -2.285226446 0.05717407 811.8827 0.9985815
print(diagnostics_final)
## mean sd ESS Rhat
## beta[1] 0.593322877 0.03285741 1562.547 1.0003486
## beta[2] 0.814018778 0.05056466 1243.412 1.0002211
## beta[3] -0.006872559 0.04329517 1915.266 0.9983268
## beta[4] 0.233257109 0.04743143 1356.929 0.9993787
## beta[5] -0.169223270 0.04671793 1608.443 1.0004451
## beta[6] 0.661826535 0.04171933 1087.169 0.9994565
## alpha -2.284415862 0.05830173 1224.729 0.9983084
print(combined_performance_summary)
## name time divergences max_treedepth elpd_loo p_loo
## 1 Base Model 9.0444250 mins 0 939 -1638.618 11.016918
## 2 Lasso Model 1.9223013 mins 0 0 -1637.283 9.666711
## 3 Final Model 0.3184474 mins 0 0 -1636.044 8.315238
## looic mean_ess mean_rhat
## 1 3277.235 474.9968 1.0029450
## 2 3274.566 666.4488 0.9998915
## 3 3272.087 1428.3563 0.9994979
Visualizations provide further insights into the model’s performance and parameter interactions.
# Traceplot for convergence and mixing
traceplot(stan_fit_updated, pars = pars)
# Histogram of posterior distributions
mcmc_hist(stan_fit_updated, pars = pars)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Autocorrelation function for samples
mcmc_acf(stan_fit_updated, pars = pars)
# Comparative visualization of posterior distributions
mcmc_areas(stan_fit_updated, pars = pars)
# Parallel coordinates plot for posterior matrix
posterior_matrix <- as.matrix(data.frame(alpha = posterior$alpha, beta = posterior$beta))
mcmc_parcoord(posterior_matrix)
# Calculate and visualize the correlation matrix
correlation_matrix <- cor(posterior_matrix)
heatmaply(correlation_matrix, width = 800, height = 800)
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: <2E171AEE-14A0-3397-A1CA-EA5D99AFF4BB> /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/modules/R_X11.so
## Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/X11/lib/libSM.6.dylib' (no such file), '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk-11.0.18+10/Contents/Home/lib/server/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk-11.0.18+10/Contents/Home/lib/server/libSM.6.dylib' (no such file), '/var/folders/wk/p2rl0_mx33z4dcl0xbbw71cr0000gn/T/rstudio-fallback-library-path-435118508 [... truncated]
Our Final Model has successfully pinpointed the key parameters for predicting customer churn in a hypothetical telecommunications company. The most informative predictors identified are the possession of an international plan, the total amount of credits charged, and the frequency of calls to customer service.
Our statistical analysis highlighted that coefficients for
beta[3] (Total Minutes) and beta[6] (Total
International Minutes) were less relevant, marked by significant
uncertainty due to their standard deviations exceeding the mean values.
Consequently, these variables were excluded, simplifying the model
without compromising its predictive capability.
Ultimately, the Final Model’s enhanced performance was evident in its reduced computational time, improved leave-one-out cross-validation (looic), and increased Effective Sample Size (ESS) for beta parameters. These improvements collectively demonstrate the model’s robustness and reliability in predicting churn.
This analysis reaffirms the effectiveness of Bayesian modeling in extracting meaningful insights from data, providing a solid foundation for developing targeted strategies to reduce customer churn in the telecom sector.